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In this work we combine two distinct machine learning methodologies, sequential Monte Carlo and 
Bayesian experimental design, and apply them to the problem of inferring the dynamical parameters 
of a quantum system. We design the algorithm with practicality in mind by including parameters 
that control trade-offs between the requirements on computational and experimental resources. The 
algorithm can be implemented online (during experimental data collection), avoiding the need for 
storage and post-processing. Most importantly, our algorithm is capable of learning Hamiltonian 
parameters even when the parameters change from experiment-to-experiment, and also when ad- 
ditional noise processes are present and unknown. The algorithm also numerically estimates the 
Cramer-Rao lower bound, certifying its own performance. 

I. INTRODUCTION 

Building a large scale quantum information processor is a significant challenge. First, we require an accurate 
characterization of the dynamics experienced by the device to allow for the application of error correcting codes 
and other tools for implementing useful quantum algorithms. Characterization is carried out through the statistical 
estimation of parameters describing the quantum states and processes involved (also called tomography) pQ. This 
characterization problem is especially timely, since quantum simulation experiments are approaching a complexity 
where classical computers are unable to simulate their evolution [2— -4J . In cases where so called analog quantum 
simulation is applied, the validity of the simulation directly depends on the accuracy with which the Hamiltonian 
of the quantum simulation conforms to the dynamical model that the experimenter believes describes the simulator. 
At present, such simulators are certified by comparing their outputs to those expected from a classical-computer 
simulation [3J 0] . This means that new methods for Hamiltonian characterization will be vital for certifying the next 
generation of quantum simulators. 

In quantum tomography, the usual scenario discussed has been full quantum process estimation - there really is 
a "black box" that the experimenter knows nothing about. While conservative, this is highly unlikely in practice 
because physics typically gives some insight into the form of the Hamiltonian which gives rise to the process. This 
additional knowledge reduces the number of parameters of the process and processes for learning these parameters 
are sometimes called partial process estimation or partial tomography [51410]. If our goal is to build a quantum 
information processing device, we must consider also an additional complication: a characterization of a process at a 
"snap-shot" in time is not nearly as useful as a characterization of the dynamics a quantum system undergoes. The 
evolution of closed systems is given by a Hamiltonian operator, and hence this process is usually called Hamiltonian 
estimation in such cases. 

One way to adapt the above schemes to Hamiltonian estimation is by stroboscopically estimating snap-shots of the 
process at fixed times and then use various algorithms to invert these to find the Hamiltonian via post-processing 1 . 
The key additional freedom we consider has received little attention to date, and is that the controls after some number 
of measurements can depend on the outcomes of those previous measurements. Hence, we call our derived strategies 
adaptive or online. Under its very broad definition, our method can be called machine learning; however, a more 
descriptive name is sequential Monte Carlo Bayesian experimental design. The marriage of sequential Monte Carlo 
methods |12) and Bayesian experimental design methods [13] has been considered very recently in a wide variety 
of classical contexts [14 -18 and also for measurement adaptive quantum state tomography [T9 2 . Other machine 
learning ideas have also been generalized to the quantum domain [21H27) . In this paper we present such an algorithm 
for learning dynamical parameters of quantum mechanical systems. 



* Corresponding author (cgranade@cgranade.com). 

1 See 1 1 1 1 and references therein. 

2 It is interesting to note, however, that online experimental design may be unnecessary (asymptotically requiring only one adaptive step) 
for the particular state tomography problem considered there [20] , 
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We make this learning process tractable by utilizing information about a system, rather than starting from worst- 
case assumptions such as those made in traditional quantum process and state tomography. We often have in practice 
knowledge about the dynamical model that describes a system of interest, and wish to improve that knowledge 
by estimating specific model parameters. Thus, practical Hamiltonian finding can often be achieved via a suitable 
parameterization of the Hamiltonian, H(x\, . . . ,Xd), reducing the problem to estimating the vector of parameters 
x = (x\, . . . , Xd)- The task we consider is the design of experiments for the purpose of deducing these parameters in 
the smallest number of experiments possible. Our algorithm also provides a region estimation for the Hamiltonian 
parameters that encloses some fixed volume of parameter space in which the mean or the variance of the Hamiltonian 
parameters are expected to be found with high-probability. We also generalize this concept to allow the algorithm to 
learn hyperparameters, which describe the distribution of the Hamiltonian parameters in cases where the parameters 
randomly vary between experiments. 

This paper is organized as follows. In section III] we review the formalism of Bayesian experimental design. Next, 



we discuss the statistical metrics we employ and the Cramer-Rao bound in section III Section IV introduces the 
sequential Monte Carlo algorithm. The test cases we use for the numerical experiments are described in section [VlT] 
In sections |V| and |VT| we discuss the application of our algorithm to region estimation and hyperparameter estimation, 



respectively. We explore the implications of the numerical benchmarking results in section |VIII| 

II. EXPERIMENTAL DESIGN FORMALISM 

The key element which interfaces quantum theory and machine learning is the equivalence of the Born rule from 
quantum theory and the likelihood function from statistics |28j . Each quantum mechanical problem specification 
produces a probability distribution Pr(dk\x; Ck), where dk is the data obtained and Ck are the experimental designs 
(or controls) chosen for measurement k, and where a; is a vector parameterizing the system of interest. 

Suppose we have performed experiments with control settings C :— {cj., C2, . . . , c^} and obtained data D := 
{di, d 2 , . . . , djy}. The model specifies the likelihood function 

N 

Pr(D\x;C) = J] Pr(d fc |x; c k ). 

fe=i 

However, we are ultimately interested in Pr(x\D;C), the probability distribution of the model parameters x given 
the experimental data. We achieve this using use Bayes' rule: 

Pr(3;|AG) - P^D\C) ' 

where Pr(a;) is the prior, which encodes any a priori knowledge of the model parameters. The final term Pr (D\C) 
can simply be thought as a normalization factor. We refer to this update proceedure as batch processing because the 
experimental controls C do not depend on the observed data and hence the processing for the experiment design can 
be done offline (meaning after all data is collected). 

An alternative to batch processing of given data is to adaptively choose the controls for the next experiment given 
the data from the past experiment. This idea can be formalized in various ways - the most natural for our purposes 
being called Bayesian experimental design [13) . For this we conceive of possible future data d^+i obtained from a, 
possibly different, set of experimental controls cjv+i- The probability of obtaining this data can be computed from 
the distributions at hand via marginalizing over model parameters 

Pr(d N+ i\D;c N+ i,C) = J Pr(d N+ i\x; cjy+i) Pr(x\D; C)dx. 

Note, in the remainder we will use the following abbreviated notation for expectation values: 

Pr(d N+1 \D;c N+1 ,C) = E xlD . c [Pr(d N+1 \x; c N+1 )}, (1) 

where the subscript on E denotes the variable for the expectation to be taken over. 

The expectation value in can be used to inform the algorithm about the choices of experimental parameters 
that are more useful than others. This usefulness is quantified, for a given choice of a utility function U(D;C), by 
the expected utility of an experiment 

U(c N+ i) = E dN+1 \ D . CN+1 ^ c [U(d N+ i;cN+i)]i 

where U(dN+i', cn+i) is the utility we would derive if experiment cn+i yielded result d^+i- The choice of the utility 
function is motivated by the figure of merit that we want to optimize. We will consider two canonical choices: 
information gain and the negative variance and discuss them in detail in the subsequent section. 
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FIG. 1: Overview of algorithm the combined experimental design and parameter estimation algorithm, showing the posterior 
distribution as a feedback into the next iteration. Blocks with a white background indicate model-dependent steps. 



III. UTILITY FUNCTIONS AND THE CRAMER- RAO LOWER BOUND 



Given a set of observed outcomes, the choice of subsequent experimental parameters that informs us most about 
the model parameters is given by the utility function. A generally well motivated measure of utility for scientific 
inference is information gain [311 132) . In information theory, information is measured by the entropy 

U(d N+1 ;c N+1 ) = E xldN+uD . CN+uC [\ogPi(x\d N+1 ,D; c N+1 ,C)]. 

Maximizing the expected value of this utility function is equivalent to minimizing the expected entropy in the posterior 
distribution, Pr(cc|djv+i, D; cjy+i, C). We also test or method with a utility function that minimizes the expected 
variance in Pt(x\(1n+i, D; Cjv+i, C). We show that this choice is optimal for minimizing the the mean squared error 
of the protocol. 

An estimator is a function x that takes a set of observed data D collected from a set of experiments with controls 
C and produces an estimate for the unknown parameters x. Here, we evaluate the quality of an estimator x by using 
a generalization of the squared error loss called the quadratic loss as our figure of merit. The quadratic loss is defined 
for a vector of parameters x, data D and experiment designs C, as 

L Q (x, x(D, C)) = (x- x(D, C)) T Q(x- x(D, C)) , (2) 

where Q is a positive definite matrix on the space of unknown parameters that defines the relative scale between the 
various parameters of interest. The quadratic loss function is useful to us in that it is computationally inexpensive 
to calculate and may be analyzed by well-known statistical techniques. In particular, the Cramer-Rao bound can be 
used to lower-bound the mean quadratic loss incurred by an estimator, under the hypothesis of a given true model x 

m- 

Following a decision theoretic methodology pH] , the risk of an estimator given a set of experiment designs C is its 
expected performance over all possible outcomes D with respect to the loss function: 

R(x,x; C)=E D]x . c [L(x.x(D- 1 C))}. 
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The Bayes risk is the average of this quantity with respect to a prior distribution on x (denoted tt) and is given 
explicitly by 

r(7r;C) = E x [R(x,x; C)] 

— J ir(x)R(x 7 x; C)dx. 

where x is assumed to be a Bayes estimator, which means it is the one which minimizes the Bayes risk. When the 
loss function is taken to be squared error (in the single parameter case) or the quadratic loss (in the multi-parameter 
case), the Bayes risk is more familiarly known as mean squared error (MSE). 

For quadratic loss (and many others |35| ) the unique Bayes estimator is the mean of the posterior distribution 

x(D;C)=E a:lD . c [x}. 

Minimizing the Bayes risk of a choice of parameters is equivalent to maximizing the negative Bayes risk for that set; 
therefore, it is reasonable to choose the negative Bayes risk as our utility function. It also has theoretical benefits in 
that it is easy to compare the performance of algorithms that take E/(c/v+i) = — r ( 7r J c/v+i, C). 

The question of how well can we estimator x becomes the question of how low can we make the Bayes risk r(jr; C). 
We lower bound the achievable risk via the Bayesian variant of the Cramer- Rao bound [35] . Both require finding the 
Fisher information. In the case of multiple parameters, the Fisher information is a matrix defined by 

I(x; C) = E D , x;C [V« log (Pr(D|x; C)) ■ log (Pr(D\x; C)) 

The Fisher information does not depend at all on the prior distribution, and thus is calculated in the same way 
regardless of how many experiments have already been performed. 

The standard Cramer-Rao bound is then given by Cov(£) > I(x;C)~ 1 , where X > Y means that X — Y is 
positive semi-definite. If we choose the matrix Q associated with the quadratic loss to be Q = 1, then R(x,x; C) = 
Tr(Cov(i)) > Tr(7(a;; C) _1 ). Clearly, this statement of the multivariate Cramer-Rao bound assumes that I is non- 
singular. Singular Fisher information matrices arise when there are experiments that provide no information about 
at least one of the experimental parameters. Unfortunately, that assumption is not met in general. We avoid this 
problem by considering the Bayesian information matrix J(tt;C) — K x [I(x; C)]. Then, the Bayesian Cramer-Rao 
bound (BCRB) is given by [36 

r(ir;C) > J{tt;C)- x . 

Lower bounds can be found for specific values of C using numerical integration. Here we apply an iterative algorithm 3 
to sequentially monitor the lower bound. We will subscript the various quantities of interest by N, which means "at 
the N th measurement" . Then, the Bayesian Cramer- Rao bound is given by 

r(ir;c N+ i) > JAr + i(cAr +1 ) _1 , (3) 

where the iteration is given by 

J n+i(cn+i) = J(tt;c n+ i) + J n (cn) 
and the initial condition is Jq = E X [V X log (7r(a;)) • log (7r(a;))]. 



IV. SEQUENTIAL MONTE CARLO ALGORITHM 



Monte Carlo methods remedy problems that deterministic numerical integrators encounter in finding the volume of 
multi-dimensional spaces. Specifically, the errors incurred by Monte Carlo integrators do not depend on the dimension 
of the space. The variant of Monte Carlo integration that we use dates back to 1993 [38] but has been rediscovered 
many times in a wide variety of scientific applications under the following names 4 : sequential Monte Carlo, particle 



3 This type of algorithm has been given some attention recently for "state space models" and classical signals with additive noise |37l . 

4 See, for example, [12] for a recent tutorial on these methods. 



5 



filters, sequential importance sampling, Bayes niters, and so on. We will use the term sequential Monte Carlo (SMC) 
and will now sketch the idea behind the algorithm. 
Recall the Bayes update rule for one datum d\, 

Pr(a;|di) oc Pr(di|x)Pr(x). 

The distribution can be processed sequentially as more data arrive: 

Pr(as|d 2 ,di) oc Pr(d 2 |cc) Pr(x\dx), 
PT(x\d 3 ,d 2 ,di) oc Pr(d 3 |a;) Pr(a;|d 2 , di), 



These updates are difficult to perform because the evaluation of the posterior distributions require evaluations of the 
costly multi-dimensional integrals over the parameter space. We address this issue by using sequential Monte Carlo 
methods, which approximate a distribution over parameters with a distribution that has support only over a finite 
number of points (often referred to as particles). The support of the distribution over these points is called the weight 
of the particle, and by convention the sum of all weights must be 1. More concretely, we approximate an arbitrary 
distribution by 

n 

Pr(x\D) ^J2 w k(D)S(x - x k ), 

k=l 

where the weights at each step are iteratively calculated from the previous step via 

n 

w k {d j+ i UD)= }^ Pr(dj+i \xk)w k (dj ) . 

k=l 

Using this form of a SMC algorithm also has the advantage of ensuring that the prior and posterior distributions for 
the update always have support over a finite number of particles, which simplifies the analysis of our algorithm. 

The particle approximation can be made arbitrarily accurate by increasing the number of particles and will be a 
good approximation at every update provided we feed in, at the initial stage, the appropriate weights {uik} and support 
points {a^fc}. Since both the weights and support points of the particles carry information about distributions over 
the model parameters x, we can without loss of generality choose the initial weights to be uniform, Wk — l/n for all k, 
and the initial support points to be samples from the correct prior Pr(x). Having made the particle approximation, 
we can compute expectation values and variances according to Algorithms [T] and [2j and can perform Bayes updates 
according to Algorithm [3j 

Algorithm 1 Estimators for mean and covariance using Sequential Monte Carlo. 

Input: Particle weights Wi, i £ {1, . . . , n}. 

Input: Particle locations x», £ E {1, . . . , n}. 

Output: Approximations /i and £ of E[se] and Cov(a;). 

function MEAN({Wj}, {a^}) 
return fl<^J2i WiXi 

end function 

function Cov({uii}, {a;^}) > Estimates the covariance matrix Cov(a;) = E[cca; T ] — E[cc]E[a;] T . 

return S 
end function 



Sequential Monte Carlo techniques require careful effort to avoid introducing errors due to limited numerical preci- 
sion. The first problem any SMC algorithm runs into is zero weights. This is doubly painful since we are effectively 
operating with fewer particles but using the same amount of computational resources. Since the support of our ap- 
proximate distribution is a measure-zero set according to the correct distribution, all the weights will eventually be 
zero; we cannot avoid this but it can be postponed by using resampling techniques. 

Generally, the idea behind resampling is to adaptively change the location of the particles to those which are 
most likely. The simplest of these types of algorithm chooses n particles (the original number), with replacement, 
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Algorithm 2 Estimator for arbitrary expectation values using SMC. 
Input: Particle weights Wi, i £ {1, . . . , n}. 
Input: Particle locations Xi, i £ {1, . . . , n}. 
Input: Function f(x) of model parameters to average over. 
Output: Approximation E[/(aj)]. 
function MEANFN({w;i}, {x^, f) 

return £\ w l f(x i ) 
end function 



Algorithm 3 Sequential Monte Carlo update algorithm. 

Input: Particle weights Wi, i £ {1, . . . , n}. 
Input: Particle locations Xi, i £ {1, . . . , n}. 

Input: New datum D, obtained from an experiment with control C. 
Output: Updated weights w'i- 

function UPDATE({^i}, {a;,:}, D, C) 
for i £ 1 — > n do 

Wi <s— Wi Pr(D\xi, C) > The updated weights Wi are unnormalized. 

end for 

Wj 4— Wjl £\ Wi > We must normalize the updated weights before returning, 

return {w'j} 
end function 



according to the distribution of weights then reset the weights of all particles to 1/n. Thus, zero weight particles 
are "moved" to higher weight locations. To determine when to resample, we shall compare the effective sample size 
n css = 1/X)i w i to a threshold resample_threshold, which is the effective ratio of the original number of particles 
n. We use resample_threshold= 0.5, as suggested by [3T)] . 

The resampling algorithm we use was first proposed in |39j and is given explicitly in Algorithm [4] The idea behind 
the algorithm conforms to the intuition given above but it incorporates randomness to search larger volumes of the 
parameter space. This randomness is inserted in the resampling algorithm by applying a random perturbation to 
the location of each particle that is introduced during the resampling process. Thus, the new particles are randomly 
spread around the previous locations of the old. More formally, we model this by randomly choosing a particle location 
Xi, then perturbing it by a normally distributed vector e ~ A/"(0, £) (we will come back to how to choose the mean 
and covariance). The new particles are thus samples of the convolved distribution 

P( x> ) = J2 Wl yp^^ ° Xp ~ Mi)^ -1 ^ ~ ' ( 4 ) 

where k is the number of model parameters. A distribution of this form is known as a mixture distribution, and can 
be efficiently sampled by first choosing a particle, then choosing a perturbation vector. 

To choose the mean /x i of each term in the resampling mixture distribution, we choose a vector that is a convex 
combination of the original particle location Xi and the expected model fj, = E[a;], so that 

Hi = aXi + (1 - a)n, 

where a is a tunable parameter of the resampling algorithm. We will use a = 0.98, as suggested by [39]. The covariance 
of each perturbation is then given by 

S = (1- a 2 )Cov[£c]. 

Our resampling algorithm then involves drawing n new particles from the distribution given by Q and setting the 
weight of each new particle to 

There are a few details to address regarding the efficiency of the SMC algorithm. The first thing to note is that, 
since the choice of resampling algorithm is usually tailored to the problem at hand, it is hard to say something in 
general about the algorithmic complexity of it. A more pressing issue for us, however, is that quantum simulation 
is required in order to evaluate the likelihood function. This step will not generally be efficient because no known 
classical algorithm exists that can simulate generic quantum dynamics in time polynomial in the number of interacting 
subsystems. 

Thus, since calls to the likelihood function are expensive, we wish to minimize the number of times it is called. 
To achieve this, we use an approximation that involves using only the highest weighted particles to compute the 
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Algorithm 4 Sequential Monte Carlo resampling algorithm. 



Input: Particle weights u)i, i £ {1, . . . , n}. 
Input: Particle locations Xi, i G {1, . . . , n}. 
Input: Resampling parameter a £ [0, 1]. 
Output: Updated weights w[ and locations x\. 
function RESAMPLE({u>i}, {xi}, a) 

fi <- MEAN({ ro,}, {a;,}) 

h <- VI - a 2 

Y,<-h 2 Cov({«Ji},{xi}) 

for i 6 1 — > n do 

draw j with probability Wj > Choose a particle j to perturb. 

fl i axj + (1 — a)(j, t> Find the mean for the new particle location, 

draw x'i from Af(m, S) > Draw a perturbed particle location. 

Wi 1/n > Reset the weights to uniform, 

end for 

return {wi}, {a;-} 
end function 



expectation values appearing in the utility function. Note that this will not reduce the accuracy of the estimation 
of parameters given experiments - rather, it reduces the accuracy with which we choose optimal experiments. This 
is the ideal place to make the approximation since the optimization routine makes the most calls to the likelihood 
function and no experiment is "bad" in the sense that the expected risk (and hence the actual risk, on average) can 
increase. This idea is formalized in the algorithm by setting a parameter approx_ratio which is the percentage of 
particles to use for the approximation. We give pseudocode for this algorithm in Algorithm [6j 

Algorithm 5 Approximate utility functions using Sequential Monte Carlo. 

Input: Particle weights Wi, i £ {1, . . . , n}. 
Input: Particle locations Xi, i (E {1, . . . , n}. 
Input: Control description C. 
Input: Positive semi-definite scaling matrix Q 
Output: Utility (7(C). 

function UtilNV({w;}, {xt}, C, Q) 

for D e 1 -> ^outcomes 

do 

{to-} 4- Update({w»}, {x^, D, C) 

fl 4- MEAN({^}, {x x }) 
U D < J2 t Wi(xi - fl) T Q(Xi - (i) 

u D <-ud -Yli w i Pr(-Dja;i, C) 
end for 
return ~}2 D ud 
end function 

function UTlLlG({wi}, {a;.;}, C) > Calculates the information gain utility function, 

for D e 1 — >• n ou tcomc S do 
for i G 1 — > n do 

Pn\ Xi «- Pr(D\xi, C) 
end for 

PD <~ Y.i W iVD\v:i 

end for 

return / <s— Hd{pd) — 5Zi WiHoipD^j) > Ud is the entropy over data D. 

end function 



> Calculates the negative variance utility function. 

> Find the hypothetical weights, had we obtained the datum D. 

> Calculate the mean using the updated weights. 
t> Reduce the variance to a scalar by using the scaling matrix Q. 
> Weight the partial utility by the marginalized likelihood Pr(D|C). 



We combine these prior algorithms to obtain Algorithm [7J which is our complete algorithm for adaptively designing 
experiments using the SMC approximation. Note that we have left unspecified here the choice of local optimizer; in 
practice, this will be chosen depending on what works for a given experimental model. In Section VIII we compare 
the Newton conjugate-gradient (NCG) and nonlinear conjugate-gradient methods to a "null" optimizer that only 



evaluates the utility function at the initial guesses. Also in Section VIII we consider which choices of n, n 
approx_ratio result in a useful estimation algorithm. 



guesses 



and 
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Algorithm 6 Reduced particle approximation for Sequential Monte Carlo utility functions. 
Input: Particle weights u>i, i G {1, . . . , n}. 
Input: Particle locations Xi, i € {1, . . . , n}. 

Input: Ratio approx_ratio of the particles to keep in the reduced approximation. 
Output: Reduced sets of particle weights {ibi} and locations {xt}. 
function REAPPROx({tUi}, {xi}, approx_ratio) 
n ^— [n ■ approx_ratioJ 

draw 7r uniformly at random from Sym(n) t> Sym(n) is the symmetric group acting on n elements. 

{wi} <s— > We permute the elements to avoid introducing patterns when sorting the particle weights. 

{Xi\ <- {x^ {i) } 

{sk} <— Sort({wj}) > Get a list of indices s» such that w Si > w Sj for all i, j. 

return {wi} <— {w Si : i G 1 — > n}, {Ai} <— {x Si : i G 1 — ► n} 
end function 



Algorithm 7 Complete adaptive Bayesian experiment design algorithm, using sequential Monte Carlo 

approximations. 

Input: A number of particles n to be used. 
Input: A prior distribution it over models. 
Input: A number of experiments N to perform. 
Input: A resampling parameter a G [0, 1]. 

Input: A threshold resample_threshold G [0, 1] specifying how often to resample. 

Input: An approximation ratio approx_ratio. 

Input: An local optimization algorithm LocalOptimize. 

Input: A particular choice of utility function Util. 

Input: A heuristic GuessExperiment for choosing experiment controls, and a number n gue sses of potential experiments to 

consider in each iteration. 
Output: An estimate x of the true model xq. 

function ESTIMATE ADAPTIVE(n, ir, N, a, resample_threshold, approx_ratio, OPTIMIZE, UTIL, n guCBBCB , GUESSEXPERI- 
ment) 

Wi 4 — \ j n > Start by initializing the SMC variables, 

draw each Xi independently from it 

for t cxp G 1 — > N do > We now iterate through each experiment, 

if approx_ratio 7^ 1 then > If we are using a reduced particle set, populate that first. 

{-Wi}, {Xi} <S— ReAPPROx({w;}, {x^, approx_ratio) 
else 

{Wi}, {x t } i- {ill,}, {Xt} 

end if 

for iguoss G 1 — > "guossos do > Heuristicly choose potential experiments, and optimize each independently. 

Ci guess GUESSExPERIMENT(i cxp ) 

Ci tneee , ?7 iguess 4r- LOCALOPTIMIZE(UTIL, C iguoaB , {Wi}, {Xi}) 

end for 

ibost ^— argmax; uosb C/i guoaB > We pick the experiment whose post-optimization utility is highest. 

C ±- C, , 

Di cxp the result of performing C > The best experiment is then performed. 

{n)i}, {xi} <— UPDATE({uii}, {xi}, D, C) > This update carries the posterior distribution forward. 

if wf < N ■ resample_threshold then > Resample if the effective sample size n css < resample_threshold. 

{Wi}, {Xi} RESAMPLE({wi}, {Xi}, a) 
end if 
end for 



return x <— MEAN({wi}, {a;;}) 
end function 



> After all experiments have been performed, return the mean as an estimate. 
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V. REGION ESTIMATION 



In addition to providing an accurate estimate of the true model parameters for the system, it is important to be 
able to quantify the uncertainty in the estimated model parameters. This task can be achieved by finding a region 
X of the space of models such that Pr(a;o G X) is maximized and such that Vol(A) is minimized. This is useful, for 
instance, if we consider the region estimate X as input to an optimal control theory (OCT) algorithm that describes 
the range of dynamics experienced by a quantum system. Since the OCT algorithm must find a control design that 
is robust for every point in that range, the cost of that optimization increases with the volume of X. 

Because we are interested in the probability Pr(cc 6 X) of the true model x lying within our region estimate X 
for a given run of the SMC algorithm, we say that J is a credible region [JDIST]. This is in contrast to confidence 
regions which are functions from data records D to regions X con f(D) such that the probability of obtaining a data 
record for which Xo £ X con [(D) is at least some threshold That is, credible regions give probabilities about 

a single data record, while confidence regions give probabilities about all possible data. Broadly speaking, credible 
regions are Bayesian analogues to the frequentist concept of confidence regions. Unless otherwise noted, we concern 
ourselves here with credible regions as obtained from posterior distributions. The two kinds of region estimates are 
closely related, as has recently been explored in the context of quantum tomography [441 145) . 

We make the problem of finding a credible region estimate amenable to analysis by SMC by turning the problem 
into that of estimating an expectation value. In particular, the probability of the true model being within a region 
can be expressed as 

Px(x eX)=E[l jt ], 
where l-o- is the indicator function for X, defined by 



l x e x 
x 4 X 



The expectation value E[l^] can then be computed using Algorithm [2j giving that 



J2 w * 



Thus, by construction, any region containing particles of total weight at least r will have an approximate probability 
mass of at least r. We formalize this intuition by introducing a probability mass function m(R) on regions R such 
that 



m(R) =E[1 R ]. 

Similarly, let rh(R) = J2i ieR Wi ^ e an approximation of m{R) using the SMC algorithm. 

We thus seek a region X such that Vol(A) is small, m(X) is large and such that X is an efficiently computable 
property of the current SMC state. We achieve the latter two properties by choosing some appropriate geometric 
function of a set of particles X r whose weight is above some threshold weight r; for example, the convex hull or the 
minimum- volume enclosing ellipse of X r both satisfy m(X r ) > r and may be computed using well-known classical 
algorithms [151 117] . 

We improve on these results by supposing that, after collecting a reasonable amount of data, the posterior distribu- 
tion is approximately normally distributed according to Af([i, X) for some /x and X. This assumption holds when the 
Fisher information is non-singular, and we find in the case of our benchmarks that it approximately holds if Pr(a;|_D) 
is sharply peaked. Under this assumption, it follows from the definition of the multivariate normal distribution that 
the inverse covariance matrix X -1 describes an ellipse such that approximately 0.682 d of the probability mass is 
contained inside the ellipse, where d is the number of unknown model parameters, d = dim a;. In particular, the co- 
variance matrix transforms a vector z of length d with each component drawn from the standard normal distribution 
A/"(0, 1) into a random variate of a multivariate normal distribution with the given covariance, and so inverting that 
transformation gives the z-score for each component. 

More generally, under the assumption of a normally distributed posterior, the error ellipse of points x satisfying 



(x- fif^-^x- n) < Z 2 



(5) 
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for some Z > will contain a ratio 

(cdfv(2) - cd^(-Z)) d = erf 

of the particle weight, where cdfjs(Z) is the cumulative distribution function for the normal distribution, evaluated 
at Z. Thus, if the assumption of a normal posterior is a good approximation, then the estimated covariance matrix 
according to Algorithm [TJ can be used as a region estimator. 

The volume of the covariance ellipse region estimator can then be found by again treating S as a transformation 
of a d-dimensional coordinate system, so that 

7T d / 2 

K ' r(f + i) v ; 

We test that the normal posterior assumption is approximately met by finding the approximate probability mass 
to(Cov(5;) _1 /Z 2 ) for a given Z, and comparing to the expected cdf. Moreover, we compare the optimal size of the 
covariance ellipse to the actual size by appealing to the Bayesian Cramer-Rao bound, since E„.[Cov(;c)] > J(n- C) _1 . 
This comparison will be explored further in Section |VIII[ and will be used as a performance metric for our algorithm. 




VI. HYPERPARAMETER ESTIMATION 



Now that we have discussed the general framework for using Bayesian inference to learn Hamiltonian parameters, we 
will proceed to discuss an important generalization of the prior work. The generalization that we consider addresses the 
fact that quantum systems seldom have consistent Hamiltonians from experiment to experiment, due to experimental 
errors. Hyperparameters allow us to generalize the Hamiltonian learning problem from one involving learning the 
Hamiltonian parameters to one that involves learning the parameters that describe the distribution of Hamiltonian 
parameters. In this way, the method of hyperparameter estimation is an alternative to approaches such as quantum 
smoothing [29| 130] . which integrate over the history of a time- varying random parameter, rather than considering the 
distribution from which each realization is being drawn. 

We denote the hyperparameters for a model Hamiltonian as y to avoid subtle conceptual differences between the 
hyperparameters and the distributions on x that they describe. The probability distribution for x can then be written 
as Pr(x|y). Despite interpretational differences, the hyperparameters can also be learned using Algorithm [7] in exactly 
the same way that x is learned. The region estimates yielded by the algorithm are region estimations for y and, as 
we will show shortly, can easily be converted into region estimates for x. 

The drawback to this approach is that computations of the likelihood function can become much more expensive. 
Specifically, in order to compute the likelihood function Pr(D\y) we need to compute the probabilities of data d 
emerging from an ensemble of randomly sampled experimental parameters taken from the distribution described by 
y. A large number of samples, N a , may be required in some cases since the sample error scales as l/v N s . On the 
other hand, the approach is straight forward to implement because it does not require either increases to the number 
of particles or changes to the underlying algorithm. 

In some important special cases, this drawback can be avoided by analytically performing the marginalization over 

x, 

Pr(D\y) = J dxPr(D\x)Pr(x\y). 

In Section [VII C| we discuss a particular case where the marginalization is analytically tractable. 

The resulting means and covariance matrices for y can be readily converted to the corresponding quantities for x 
by using the chain rule for expectation values, 

E x Jx}=E y [E xly [x}}. (6) 

This expectation value can be computed using the posterior distribution Pr(y\D) and the intermediate model distri- 
bution Pi(x\y), which will typically be easy to compute from the definition of the hyperparameters. The covariance 
matrix for x is slightly more complicated. It is straightforward to verify that 



Cov x , y (x) = E y [CoVa,| y (x)] + Cov y (Ea,| y [a;]) 



(7) 
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For the special case that a; is a single parameter, the covariance can be replaced with the variance to obtain that 

Va,r Xty {x) = E y [Var,e| y (a;)] + Var y (E^^x]) . (8) 

Using the region estimate given by ^ to estimate a hyperparameter region translates to a region estimator for the 
model parameters x, if the distribution over hyperparameters y is approximately Gaussian near its peak. In the limit 



of many experiments, we find that this is a good assumption, as is discussed in Section VIII 

An important consequence of this derivation is that the same SMC algorithm can be used to estimate regions of 
model parameters, even when those model parameters vary with each experiment. In particular, as the covariance 
Cov y (x) in our estimate x of the mean model parameter vector decreases, our estimate of the model region approaches 
the "true" model region given by Cov x \y(x\y). 



VII. TEST CASES 



We assess the performance of Algorithm [7] through a number of test cases that are designed to examine the 
performance of the algorithm in a number of different relevant settings. Here we describe the test cases, which we 
build up in complexity starting from the simple model studied in detail in references [48H50| . The first case that 
we consider is learning a single parameter for an experiment in the presence of a known decoherence time. The 
second case generalizes this by allowing Ti to be unknown. This problem is particularly important because existing 
experiments require substantial processing to learn both the unknown parameter and the decoherence time. Finally, 
we consider a two-hyperparameter model with a single-parameter Hamiltonian and perform region estimation of both 
the hyperparameters and the parameters that are distributed according to them. 



A. Single Parameter Hamiltonian 



We first consider the example with one unknown and one control parameter. Suppose that a qubit evolves under 
an internal Hamiltonian 

H{u) = |oV 

Here u is an unknown parameter whose value we want to estimate. An experiment consists of preparing a single 
known input state ip- m = |+), the +1 eigenstate of cr Xl evolving under the Hamiltonian H for a controllable time t and 
performing a measurement in the cr x basis. The k th measurement has two outcomes which we record as dk € {0, 1}. 
The design specification for each experiment is given by the time tk that the state is allowed to evolved for under 
the unknown Hamiltonian in this case. Protocols that are provably optimal have been obtained by finding analytic 
expressions, and lower bounds, on the accuracy of generic protocols [50] . 

We will slightly generalize this model by allowing noise sources which lead to a decay in the information extractable 
from any measurement. This can manifest from, for example, a T 2 dephasing process which leads to the following 
likelihood function: 

*_ 

Pr(0|^;<) = cos 2 (|t) + 1 - ? — , (9) 
Pr(l|w;f) = 1-Pr(0|w;i), 

where uj is the unknown parameter to be estimated, t is the controllable parameter and T 2 is a known constant. 

This model was studied in references [15H5D] . There the model was able to be treated analytically. For the case 
with no noise (T 2 =00), given a normal prior with variance <r 2 , the risk scales exponentially as r ~ <r 2 (l — e -1 )^, 
whereas for finite T 2 , the scaling is exponentially suppressed when the measurement times reach t = T 2 . 



B. Two Parameter Model with Single Control 



Here, we are going to consider the same model from the previous section, 

±_ 

Pr(0Kr 2 ;t) =e'^cos 2 + — , (10) 

Pr(l|w, T 3 ;t) = l-Pr(0|w,T 3 ;t), 
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but where now both ui and Ti are unknown. The time t remains the only experimentally controllable parameter. For 
numerical convenience, we choose to parameterize this model as x — (ui,^ 1 ), so that each unknown parameter has 
the same dimensions. 

Even for such a simple generalization as this, the methods discussed above are not adequate for this more general 
problem. In particular, the Fisher matrix of any one measurement is singular and hence the standard Cramer-Rao 
bound does not hold - nor is it possible to utilize standard asymptotic approximations to normal distributions. 
Although we cannot find an analytic expression for the Bayesian Cramer-Rao bound in the same way we did for the 
single parameter problem, we use our SMC algorithm to efficiently numerically estimate it. 

In general, there is no reason to expect that us and T^ 1 will be expressed in such a way as to admit the same 
scale, and so in estimating using this model, we must set the semidefinite matrix Q for the loss function ([2]). This is 
discussed further in Section IVIIIBl 



C. One Parameter Model with Hyperparameters 



We also derive a two-parameter model similar to ( 10 1 by again considering a single qubit undergoing Larmor pre- 
cession as in {[oj) , but where the "true" precession frequency ui is itself distributed according to a Gaussian distribution 



of mean /i and variance a . In this case, following the discussion of Section VI the probability of data conditioned 



on the hyperparameters y = (/i, a) can be found by marginalizing over the intermediate random variable ui, so that 

Pr(d|/x,cr;t) = /pr(d|w) Pr(w|/i, a)du. (11) 
For the specific example of the Gaussian distribution, 

Pr(0|^, cr;t) = [ cos 2 ( ^ J e~ i ^^duj (12) 



aV2n 
1 
2 



1 + e- la ' cos(2/xt) . (13) 



At this point, we have entirely removed u> from the problem, leaving a two-parameter model, where we wish to estimate 
the mean and variance of an unknown normal distribution. 

As another example, instead of marginalizing against a Gaussian distribution, we consider the case that the inter- 
mediate model parameter ui is drawn from a Lorentz distribution. A Lorentz distribution is completely determined by 
its location and scale parameters ujq and 7, respectively, and so we use these hyperparameters to derive a new model, 

Pr(0|w„,7;*) = /cosVi/2) 1 ^ (14) 

J ^ 7 ((™)i + 1 ) 

= - (1 + e -*T cos (tuj )) . (15) 

Note that if we identify 7 = 2T T 1 , then the Lorentz hyperparameter model is the identical to that of Equation 



(10 1. This illustrates the relationship between decoherence processes and the lack of knowledge formalized by a 



hyperparameter model. In a similar fashion, ( 13 1 is also model of decoherence. Due to the t 2 dependence of the 
Gaussian-hyperparameter model, ( 13 1 represents a decoherence process that cannot be written in Lindblad form |51j 
because it cannot be drawn from a quantum dynamical semigroup. 

The approach of introducing hyperparameters allows us to estimate unknown distributions using the same SMC 
algorithm[7]that we use for estimating other model parameters, by making assumptions about the form of an unknown 
distribution. Using techniques developed in Section |VI| we can also extend region estimation to hyperparameter 
models to obtain regions on the intermediate model (in this case, ui) using regions on y. We discuss the resutls of 
these application in detail in Section |VIII B| 



VIII. RESULTS AND DISCUSSION 



We will now turn our attention towards assessing the performance of Algorithm [7] in practical examples |52| . Our 
results show that our adaptive Bayesian algorithm is able to learn model parameters using a very small number of 
experiments, including adverse situations where hyperparameters are needed to describe the fluctuations of model 
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N = 1 N = 6 N = 11 




FIG. 2: Left to right: the likelihood function for iV = 1, 6, and 11 simulated measurements at random times in in (0, 5n). The 
model is that given in equation |9| with T2 = 1007T. The red dots (and red arrows) are the randomly chosen true parameter uj. 
The blue dots are the n = 100 sequential Monte Carlo "particles". 

Expected Loss (1 00 particles) Expected Loss (1 ,000 particles) Expected Loss (1 0,000 particles) 

E(L) E(L) E(L) 
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Unoptimized BCRB for Unopt. NCG Optimized BCRB for NCG 



FIG. 3: Left to right: the performance, as a function of the number of measurement N, of the sequential Monte Carlo 
algorithm for n = 100, 1 000, and 10 000 particles. The model is that of equation (JoJ) with T2 = 1007T (see also figure The 
dashed lines indicate data taken without local optimization, while the solid lines indicate trials in which initial guesses were 
optimized using the NCG method. For each data set, the corresponding thick line indicates the Bayesian Cramer-Rao bound. 
Errors in estimating the performance are indicated by red shaded regions around each curve. 



parameters between experiments or where the decoherence time of the system is unknown. This performance is espe- 
cially noteworthy in the case of an unknown decoherence time Ti- Though methods of learning unknown decoherence 
processess have been developed and are well- understood [5"3"rl5"5] , these methods require multiple iterations of quantum 
process tomography implemented using either ensemble measurement or a large amount of data obtained using strong 
measurement. In the case discussed here, we can adaptively use prior information to obtain accurate estimates of T2 
using significantly less measurements than methods currently employed in systems with strong measurement |32) . 

Figures [2| and |T2| build intuition for how algorithm [7] actually learns parameters by describing a trial run for each of 
the models in Sections VII A and VII B These illustrate the movement of the particles, through resampling, to regions 
of high likelihood. In particular, we note that Fig. [2] demonstrates that only 11 experiments are needed in order to 
achieve a tight approximately Gaussian posterior distribution over the model parameter ui for the known T 2 model 
described in Sec. |VII A| Figure [12] shows that performance of the algorithm for the unknown T2 model described in 
Sec. VII B We see in that case that only 200 experiments are needed to find a distribution that is centered around 



the true values of ui and T%. 



A. Results for Known T 2 Model 



Our first set of numerical experiments examines the performance of our algorithm for the case where the decoherence 
time is known with perfect precision. Specifically, we take T2 = 1007T, uj ~ jV(0.5, 0.01) and choose the experimental 
times to be spaced uniformly such that the fc th experiment occurs at time t% = 2fc7r/3 (chosen arbitrarily) and examine 
the mean-square error in ui averaged over a minimum of 1625 trials with randomly chosen uj. This data is presented 
in Fig. [3] The figure shows that the mean-square error for uj decreases as we vary the number of particles from 100 
to 10 000, and in particular gives a relative MSE that is less than 1% for N > 100. We also see that the data remain 
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No optimization 



NCG opt., approx_ratio 0.1 
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FIG. 4: This figure compares the mean-square error as a function of the number of experiments used for the known Ti 
model with T2 = 100, 5000 particles and an approximation ratio of 1 with guessed experimental times chosen randomly from 
an exponential distribution with mean T%. On the left, data is shown for 1 guess, while on the right, we show data for 30 
guesses. The shaded region in each plot indicates a 68% confidence interval for data collected using NCG optimization with 
approx_ratio = 1.0. 



close to the BCRB (which is a lower bound on the MSE) for either n = 1 000 or 10 000 particles if no optimization 
is used, whereas n = 10 000 is needed to approach the BCRB in the case where NCG is used. This not only shows 
that several thousand particles should be sufficient for our purposes, but also that the MSE yielded by our algorithm 
scales near-optimally if a sufficiently large number of particles are used in the SMC approximation. 

Figure [3] provides a more in depth analysis of the error scaling for the known T2 model. The previous data set 
contained only one experimental guess per experiment, whereas in general we could consider making many guesses 
for the optimal experiment and choosing the one with the highest utility U(t) = — ¥,d[L]. We assess the performance 
of our algorithm as a function of the number of guesses by introducing a new guess heuristic. Instead of choosing 
uniformly spaced guesses, we choose each guessed time randomly from an exponential distribution with mean T2. This 
guess heuristic also has the advantage of using very little intuition about the structure of the Hamiltonian that we are 
attempting to parameterize, which means that we expect the performance of this heuristic to be a better estimate of 
the worst-case performance of our algorithm. 

Unsurprisingly, we find that the MSE is reduced if we choose the best of 30 possible experiments rather than a single 
randomly chosen experiment. We also see that local optimization tends to improve the quality of the approximation 
if we pick the approximation ratio to be 1. Figure [4] also shows that smaller values of the approximation ratio can 
cause the mean-square error to saturate at relatively large values. In fact, taking approx_ratio = 0.1 was sufficient 
to cause the data taken for 30 guesses and NCG optimization to have a larger mean-square error in uj than the case 
with no optimization and 1 random guess. For this reason, we take the approx_ratio = 1 for most of the examples 
in this section. In Section VIII D[ we shall see an example where approx_ratio < 1 provides more benefit. 

An individual trial may have a mean-square error that differs significantly from the expected loss. The shaded 
regions in Fig. [3] give 68% confidence intervals for the actual loss for the case with NCG and approx_ratio = 1 (the 
confidence intervals for the other data sets were similar), and find the surprising result that the mean-square error is 
frequently outside the confidence interval in the cases where NCG optimization is not used. This suggests that the 
distribution of the utility of experiments can wildly vary and that the distribution is skewed because a significant 
fraction of the guesses provide virtually no information. NCG minimizes the chances of choosing an uninformative 
experiment and hence it forces the guesses towards the more informative experiments. We should also note that there 
is room for improvement here, since the MSE is closer to the upper limit of the confidence interval. Sophisticated 
optimization procedures may therefore be of use when searching for informative experiments given an uninformed 
guess heuristic (such as our random guess heuristic). 

These results show that poor guess heuristics can be mitigated using our algorithm by optimizing over more guesses 
and using local optimization of the experiments. We also note that the median square-error performance of our 
algorithm tends to be much better than the mean square-error because a small fraction of the trials randomly choose 
very bad guesses. We see that NCG optimization can be used to cause the mean-square error to approach the 
median-square error. We therefore find that estimates of the error (such as the posterior variance or region estimates 
of w) are needed in order to guarantee that a particular trial is not pathological. 
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FIG. 5: Sequential Monte Carlo approximated covariance probability mass m(Cov(a;) _1 /Z 2 ) for the known- T2 single-parameter 
model as compared to the probability mass m nornm i w 0.9973 expected for the normal distribution and as function of the number 
N of experiments performed, averaged over 20 119 trials using a guess heuristic that chooses the k th guess to occur at time (9/8) k , 
using 30 guesses, 1 000 particles and NCG optimization. The dashed line shows the probability mass for the corresponding 
normal distribution. 



B. Region Estimation 



One of the most substantial contributions of our algorithm is its ability to provide region estimates for the location 
of the true Hamiltonian, which allow us to quantify our uncertainty in the true model parameters. We compare 
the probability mass enclosed by the covariance region estimator described in Section [V| A simplifying assumption 
is made in our analysis: we assume that the posterior distribution is approximately Gaussian. Although difficult 
to justify theoretically, we have yet to find an example for the models considered here where the posterior does 
not appear Gaussian after a sufficiently large number of experiments. Under the Gaussian model of the posterior 
distribution, we expect the true model parameters to be within an ellipse described by the covariance matrix whose 
volume is then described by the Z-score used. For example, in the one-dimensional case approximately 95% of the 
probability mass is located within 2-standard deviations, which corresponds to Z = 2. We choose Z = 3 standard 
deviations from the mean for these examples which correspond to probability masses of fh{Cav{x)~ 1 / Z 2 ) ps 0.9973 
and m(Cov(i) _1 /Z 2 ) ss 0.9946 for the one- and two-parameter cases respectively. 

Figure [5] illustrates that the approximate probability mass m app roaches the probability mass we would expect for a 
normal distribution for the known- T 2 model (introduced in Section VII A) in the limit of large N, providing evidence 
in favor of our use of the covariance ellipse as a region estimator on the posterior. In particular, we note that the 
value of rh approaches 0.9973, such that the quality of the Gaussian approximation improves as we collect data. The 
transient behavior for small experiment numbers occurs because insufficient experiments have been considered for the 
posterior to approach a Gaussian. In this specific example, the average differences in enclosed probability mass after 
each experiment are on the order of 0.01%, and thus may not be of practical signifigance. 



C. Results for Unknown T 2 Model 



We now turn our attention to the comparably challenging task of learning Hamiltonian parameters without a 
precise estimate of T2. These calculations were performed using the true distributions oj ~ A/"(0. 5, 0.0025) and 
1/T 2 ~ A/"(0.001,0.00025 2 ), and with the scale matrix Q = diag(l, 0.0025/0.00025 2 ) = diag(l, 100). The guess 
heuristic that we focus on chooses times randomly from an exponential distribution with mean 1 000, corresponding 
to the mean value of T2 according to the initial prior. 

We examine the variation of the MSE with the number of guesses used in Fig. [6] The figure shows that, in the 
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FIG. 6: Benchmarking of the "unknown- T2" model (Section VII B 1 using n = 5 000 particles and random initial guesses without 
local optimization. Data indicated dashed lines correspond to trials where a single initial guess was used for each experiment, 
while data indicated by solid lines were collected using 30 guesses per experiment. The single-guess data is averaged over 1,380 
trials while the 30-guess data is averaged over 1,109 trials. Errors in estimating performance are indicated by red shaded regions 
about each curve. 



absence of local optimization of experiment times, the MSE for both uj and 1 jTi is significantly improved by using an 
increased number of guesses. In particular, we find that if 30 guesses are used, then only 50 experiments are required 
on average to learn uj within a 0.9% error, even without a well characterized Ti- The improvement is much more 
substantial for uj than it is for I/T2 because the contrast on T2 is much less significant. 

Fig. [7] examines the effect of increasing the number of guesses for strategies that use NCG. The most significant 
qualitative difference between the data collected using NCG and that of Fig. [6] is that the MSE for uj shows no 
evidence of saturating and instead continues to shrink as the number of experiments are increased (as seen most 
clearly in Fig. |8|. This implies that our randomized guess heuristic is unlikely to randomly guess very informative 
experiments after a fixed number of experiments, but the landscape is sufficiently devoid of local optima that NCG 
optimization finds informative experiments in the vicinity of our uninformed guesses. We also observe that NCG does 
not substantially improve the MSE if 1 guess is used. This suggests that the landscape is not sufficiently convex that 
local optimization about an individual guess is likely to find experiments that are substantially more informative. We 
therefore conclude, again, that increasing number of guesses used and using NCG substantially improves the MSE for 
uj and has a much more subtle effect on the knowledge of T2 if local optimization is used. 

Similarly to the case of known T 2 , it is useful to benchmark the performance of our algorithm against the BCRB, 
which gives a lower bound on the MSE. Figure [9] provides a comparison of the MSE, the estimate of the MSE given 
by the variance of the posterior and the BCRB for uj, T^ 1 and Tr(S ■ Q). We see that the expected posterior variance 
is typically within statistical error of the MSE for all three of these quantities, suggesting that the posterior variance 
can be used as a very good estimate of the MSE for this model. We also note that the MSE is very close to the 
MSE for the T 2 -1 data and Tr(S • Q). The MSE for uj is within a constant multiple of the BCRB. We do not, in 
fact, expect that the MSE in uj should approach the BCRB because the algorithm chooses experiments to optimize 
Tr(S • Q) rather than the error for either uj or T 2 X individually. 



D. Hyperparameter Region Estimation Performance 



Having demonstrated the effectiveness of our region estimation algorithm, it remains to show that the generalization 
to hyperparameter regions works as described in Section |VI| The objective here is to analyze the robustness of our 
algorithm in the presence of fluctuating "true" parameters of the Hamiltonian. We do so by using the Gaussian 
hyperparameter model of Equation ( 12 ) as discussed in Section VII C[ then comparing the model parameter region 
volume and probability mass for the region estimated from Equation ^ to the volume and probability mass of the 
corresponding "true" model parameter region. We benchmark this model by choosing "true" hyperparameters /j, and 
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FIG. 7: Benchmarking of the "unknown- T2" model (Section VII B \ using n = 5 000 particles and random initial guesses with 
local optimization by the NCG method. Data indicated dashed lines correspond to trials where a single initial guess was used 
for each experiment, while data indicated by solid lines were collected using 30 guesses per experiment. The single-guess data 
is averaged over 1,023 trials while the 30-guess data is averaged over 930 trials. Errors in estimating performance are indicated 
by red shaded regions about each curve. 
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FIG. 8: Benchmarking of the "unknown- T2" model (Section VII B I using n = 5 000 particles and 30 random initial guesses. 
Data indicated dashed lines correspond to trials where a each initial guess was used without local optimization, while data 
indicated by solid lines were collected using NCG optimization for each guess. The unoptimized data is averaged over 1,109 
trials while the optimized data is averaged over 930 trials. Errors in estimating performance are indicated by red shaded regions 
about each curve. 



a 2 for u! according to the normal distribution 



H,a 2 ~ Af [(// M ,/i cr 2),diag(cr^,(T^ 



(16) 



Recall that the unknown frequency is distributed as w ~ ./V(/i,<7 2 ). In particular, this true distribution does not 
admit any correlation between the mean and variance hyperparameters. We then use the true distribution as our 
prior distribution. 

Figure 10(a) provides estimates of the probability mass contained within our estimated region for the Hamiltonian, 
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FIG. 9: The actual and estimated performance, as a function of the number of measurements N, of the sequential Monte 
Carlo algorithm for n = 5 000 particles. The model is that of equation ( 10 1 with unknown T2 (which is estimated as V = I/T2 
for numerical precision considerations). The dotted curve is the posterior variance of the particles; dashed is the actual mean 
squared error and solid is numerically calculated Bayesian Cramer-Rao lower bound. In subfigures (a) and (b), the MSE and 
variances are those of the individual parameters uj and T^ 1 , respectively, while subfigure (c) shows the actual and estimated 
quadratic losses scaled using Q — diag(l, a^/crH,_i), where crj and o 1 \ are the variances in cj and T 2 _1 according to the initial 
prior 7r. 



which uses a Z-score of 3. Since we assume a Gaussian posterior, we anticipate that 99.7% of the probability mass 
should lie within the region estimation of E[lj] ± 3-y/Var(w). We find very good agreement with this assumption, 
and find that at worst 99.4% of the probability mass for the hyperparameters lies within the estimated region. The 
data also suggests that these small differences vanish for the optimized data sets, which appear to approach the ideal 
enclosed probability mass of 99.7% in the limit of large N. It is also interesting that the data with approx_ratio = 0.1 
yielded the best region estimation for the probability mass (unlike the previous examples). This is likely because the 
low approximation ratio de-emphasizes the tails of the posterior and non-Gaussian behavior typically is manifested 
in the tails. This shows that there are, perhaps surprisingly, examples where taking approx_ratio < 1 is useful. 

Hyperparameters are not typically a quantity of interest by themselves. They usually are of relevance because they 
parameterize a distribution of the unknown parameter. Following Equation (JsJ) , we calculate Var(w) as 

Var(w) = Var(/t) +E[<7 2 ]. 



Figure [l0(b) compares Var(u;) to the variance parameter a 2 . As the number of experiments grows, our region estimator 



for uj slightly overestimates the "true" variance of u (on average) . We see from Fig. |10(b)| that the estimatate of the 
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FIG. 10: Benchmarking region estimators for Gaussian hyperparameter model (Section VII C I using n — 2 000 particles, 
U) ~ N{p,, a 2 ) where fj, ~ N(0.5, 0.001 2 ) and a 2 ~ A/"(0.0025, 0.0025 2 ). 



variance of the unknown frequency that is inferred from our region estimate of the hyperparameters systematically 
over-estimates the variance of w tr uo on average. This bias vanishes as the number of experiments increases. We 
can therefore conclude that we can use the method of hyperparameters to robustly estimate the distribution of an 
unknown frequency, even in the presence of noise. 



E. Computational Cost 

Another way that we can assess the cost of inferring the Hamiltonian of a system is in terms of the classical 
computing time needed to learn the Hamiltonian parameters to within a fixed error tolerance (as measured by the 
number of likelihood calls made) . Our previous discussion found that the experimental time (measured by the number 
of experiments) can be minimized by choosing measurements that minimize the risk, and showed that increasingly 
sophisticated heuristics for generating these guesses tended to reduce the experimental time. This suggests that 
a trade-off may be present between the experimental time and the classical processing time needed to learn the 
parameter. This tradeoff will become increasingly relevant as the size of the quantum system grows, since existing 
quantum simulation techniques do not scale efficiently with the number of particles in the system and thus the cost 
of performing a likelihood call may asymptotically become much more expensive than performing an experiment. 

If computational time is of primary importance (rather than experimental time), then the relative merits of the 
experimental design heuristics changes. In total, our data sets in figures [4] and [IT] required (on average) a number 
of likelihood calls that fell within the range [1.05 x 10 7 , 1.5 x 10 9 ]. A likelihood call required the evaluation of 
exp(— t/T 2 ) cos 2 + (1 — exp(— i/T 2 ))/2, which required time on the order of 10 -7 seconds on our computers and 
lead to total computational times that were on the order of a second to a minute. If the rate at which experiments 
can be performed were much faster than 200 Hz then the utility of our algorithm as a means to speed up data 
collection may be lost. If the two rates are approximately comparable, then interesting trade-offs appear between the 
computational time needed and the total experimental time. 

These trade-offs become apparent by plotting the scaling of the MSE as a function of the computational time for 
the randomized guess heuristic in Fig. |11| The first feature that is obvious from the plot is that the strategies which 
yielded the lowest MSE per experiment tend to yield the highest MSE per likelihood call; although several of these 
strategies cause the expected loss (mean-square error) to saturate after a finite number of experiments. In particular, 
this causes the strategy with 30 guesses and no optimization as well as the strategy with 30 guesses, NCG optimization 
and approx_ratio = 0.1 to intersect the curve for the cases with NCG optimization and approx_ratio = 1. On the 
surface, this seems to indicate that the more expensive heuristics may have an advantage if small loss is desired; but 
this is misleading and to get a complete picture we need to look at more than just the expected performance of the 
strategies. 

We can get a better understanding of this saturation by looking at the plot of the 84 th percentile of the loss in 
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FIG. 11: This figure compares the mean-square error as a function of the computational time for the known T2 model with 
T2 — 100, 5000 particles, approx_ratio = 1 and guessed experimental times chosen randomly from an exponential distribution 
with mean T2. The expected loss incurred by each optimization strategy is shown is shown in the left figure and the figure on 
the right shows the 84 th percentile Qo.84 of the loss, such that no more than 16% of trials incur loss greater than the shown 
percentile. 

Fig. |4j which shows that all of these strategies continue to provide improved estimates of u> even into this regime of 
saturation for at least 84% of the trials considered. This shows that there were a few trials where very poor guesses 
were chosen and the algorithm became stuck at a large MSE. The data also suggests that the use of NCG and a large 
value of the approximation ratio can mitigate these problems, causing the learning algorithm to become more stable 
at the price of requiring more computational time. 

There are many strategies that can be employed to reduce the computational time required by our algorithm. 
Firstly, since the calculation of each guess is independent of the other guesses, this task can be trivially paralellized 
on a cluster that uses very little communication between the nodes. Additionally, because the simulations do not 
need to be high precision for the method to be successful, a single-precision implementation of the simulation step 
could also be used to improve the performance of the simulation in circumstances where the quantum simulation used 
to evaluate the likelihood function is computationally expensive. Such simulations have been demonstrated using 
graphical processing units (GPUs) [55] and field-programmable gate arrays (FPGAs) [57]; the latter is of particular 
interest due to the use of FPGA devices in the control of quantum information processing systems. 

IX. CONCLUSIONS 

Our work provides a simple algorithm that applies Bayesian inference to learn a Hamiltonian in an online fashion; 
that is to say, that our algorithm learns the Hamiltonian parameters as the experiment proceeds rather than collecting 
data and inferring the Hamiltonian through post-processing. This eliminates the need to store and process gigabytes 
of data that are recovered from even relatively short experiments. Our work has several advantages over existing 
approaches to learning Hamiltonian parameters. First, it can be used to estimate the optimal parameterization of the 
dynamics of an arbitrary quantum system within a space of model Hamiltonians. Second, it can be used to provide 
a region estimatate of the Hamiltonian parameters. The importance of this is obvious: it allows us to not only learn 
the unknown parameters but also quantify our uncertainty in them. Third, our analysis of the algorithm shows a 
clear trade off between the experimental time and the computational time needed to parameterize the Hamiltonian. 

We illustrated these advantages by benchmarking our algorithm's performance for a number of computationally 
tractable Hamiltonian models that involve a qubit precessing with an unknown frequency in the presence of decoher- 
ence (finite T2 time). Our results showed that the scaling of the mean-square error of an unknown frequency with 
the number of experiments irrespective of whether T2 is known approaches the Bayesian Cramer-Rao bound, which 
is known to be optimal, given a set of experimental designs. In contrast, other methods for learning an unknown 
frequency require a well characterized T2 time as a prerequisite. Our work, on the other hand, shows that we can 
learn the unknown frequency and the unknown Ti simultaneously, which has obvious advantages if data collection is 
slow. 
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Perhaps most importantly, our algorithm also provides region estimates of the unknown parameters of the Hamil- 
tonian. This is to say that at the end of the experiments we not only have an estimate of not only the values of the 
unknown parameters but also our uncertainty in those values. We showed that the method is capable of providing 
region estimates that contain the actual unknown parameters with probability approximately 99.7%. Our algorithm 
also was used to provide construct similar confidence intervals for cases where the unknown Hamiltonian parameters 
were not constant between experiments but instead fluctuated according to an unknown distribution. These tests 
showed that our algorithm is robust and also is valuable even if the cost of performing experiments is minimal. 

Finally, we compared the costs in terms of experimental and computational time of our algorithm. We found that 
the heuristics that reduced the experimental time most significantly often required more computational time to reach 
a desired level of accuracy. Conversely, we found that many of the computationally innexpensive heuristics failed to 
reduce the mean-square errors after a finite number of experiments. This suggests that the relative merits of different 
heuristics change as the relative costs of computational time and experimental time and the precision with which the 
unknown parameters should be estimated vary. 

An obvious extension of our work would be to consider more advanced optimization heuristics than conjugate 
gradient searches (such as particle swarm optimization algorithms). Similarly, more advanced resampling techniques 
may lead to substantial reductions in the number of particles which in turn would reduce the computational cost of 
the algorithm. Finally, estimates of how the number of experiments required to achieve a specific mean-square error 
scales with the number of unknown parameters would be an important extension of this work since it would assess 
the viability of these techniques for controlling and characterizing larger quantum systems. 
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FIG. 12: The likelihood function for N — 1, 51, 101, 151 and 201 simulated measurements at random times in in (0, 207r). The 
model is that given in equation (10 1. The red dot (and red arrow) is the randomly chosen true parameter x = (w, Tb). The 
blue dots are the n = 100 sequential Monte Carlo "particles". 



